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Abstract 

An increasing number of high-performance distributed sys¬ 
tems are written in garbage collected languages. This re¬ 
moves a large class of harmful bugs from these systems. 
However, it also introduces high tail-latency do to garbage 
collection pause times. We address this problem through a 
new technique of garbage collection avoidance which we call 
Blade. Blade is an API between the collector and appli¬ 
cation developer that allows developers to leverage existing 
failure recovery mechanisms in distributed systems to coor¬ 
dinate collection and bound the latency impact. We describe 
Blade and implement it for the Go programming language. 
We also investigate two different systems that utilize Blade, 
a HTTP load-balancer and the Raft consensus algorithm. For 
the load-balancer, we eliminate any latency introduced by the 
garbage collector, for Raft, we bound the latency impact to a 
single network round-trip, (48 fls in our setup). In both cases, 
latency at the tail using Blade is up to three orders of mag¬ 
nitude better. 

1. Introduction 
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Figure 1: CDF of request latency for a ZooKeeper-like [29] repli¬ 
cated key-value store using the Raft [44] consensus algorithm writ¬ 
ten in Go. System was configured with 3 nodes and 10 clients gen¬ 
erating a total of 250 requests-per-second (3:1 get/set ratio) over 
10 minutes. A parallel, stop-the-world (STW) mark-sweep collec¬ 
tor was used, with a heap size of 500 MB for a 200 MB working set. 


Recently, there has been an increasing push for low-latency 
at the tail in distributed systems [18, 45, 54]. This has arisen 
from the needs of modern data center applications which con¬ 
sist of hundreds of software services, deployed across thou¬ 
sands of machines. For example, a single Facebook page load 
can involve fetching hundreds of results from their distributed 
caching layer [42], while a Bing search consists of 15 stages 
and involves thousands of servers in some of them [31]. These 
applications require latency in microseconds with tight tail 
guarantees. 

Recent work addressed this at the operating system and 
networking layer [3, 9, 32, 48]. However, this is only half the 
picture. Increasingly, application developers choose to build 


[Copyright notice will appeal' here once ’preprint’ option is removed.] 


distributed systems in garbage collected lagnuages. For ex¬ 
ample, a large number of distributed systems are written in 
Java [29, 58, 65], and Go [21, 24-26]. Garbage collected lan¬ 
guages are attractive because manual memory management is 
extremely bug-prone [13, 17]. 

Unfortunately, garbage collection introduces high tail- 
latencies due to long pause times. For example. Figure 1 
shows the impact of garbage collection on the tail-latency of 
one such distributed system. While many distributed systems 
require average and tail-latencies in microseconds, garbage 
collection pause times can range from milliseconds for small 
workloads to seconds for large workloads. 

Moreover, dealing with pause times in the application is 
hard. The impact of garbage collection is often unpredictable 
during development and difficult to debug once deployed. 
First, from the programmers perspective, garbage collection 
can occur at any point during execution. Second, performance 
can vary greatly from system to system, or even over the life¬ 
time of a single system [20, 59]. Finally, tuning the garbage 
collector of a deployed system is hard because performance 
is workload dependent. As a result, users must continually 
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adjust run-time system parameters (e.g., generation sizes) 
based on production workloads. 

None of the current approaches to garbage collection are 
suitable for this new set of requirements where the 99.9th per¬ 
centile matters. On one side, language implementers attempt 
to build faster collectors [23, 27, 51, 61]. However they are 
generally concerned with average case behaviour and opti¬ 
mising across a large set of use-cases [11]. As such, pause 
times at the tail are still too long. Moreover, the effort to 
build better collectors must be replicated for each language 
runtime. On the other side, developers deploying such sys¬ 
tems in production may turn off the garbage collector all to¬ 
gether*, or switch to manual memory management, giving up 
the productivity gains of memory-safe languages [49]. 

We propose a new approach to building distributed sys¬ 
tems in garbage collected languages, called Blade, that 
gives control over tail-latency back to the programmer. In¬ 
stead of attempting to minimise pause times, distributed sys¬ 
tems should treat pause times as a frequent, but predictable 
failures. BLADE is an interface to the run-time system that al¬ 
lows programmers to participate in the decision to pause for 
collection, customising the collection policy to their system. 
Blade’s simple API allows systems builders to; 

1. eliminate garbage collection related latency 

2. by leveraging system-specific failure recovery mecha¬ 
nisms to mask pause times, 

3. and model the performance impact of garbage collection 
without knowledge of the production workload. 

In this paper, we describe and evaluate the Blade API. 
We implemented Blade for the Go programming language 
and used it to eliminate garbage collection related tail-latency 
in two different distributed systems. The first system is a 
cluster of web application servers behind a load-balancer, and 
the second is the Raft [44] consensus algorithm. 

We compare Blade in both systems against the default 
Go garbage collector, and the optimal solution for perfor¬ 
mance of no garbage collection at all. For the HTTP clus¬ 
ter, Blade completely eliminates any latency impact on re¬ 
quests caused by garbage collection, while for the Raft con¬ 
sensus algorithm, it bounds the latency impact to a single ex¬ 
tra network RTT (48 /rs in our experimental setup). In end-to- 
end tests, this matches the performance of the optimal system 
with no GC. 

The rest of the paper is organized as follows. In Section 2, 
we motivate the problem and explain why existing solutions 
do not work. In Section 3 we outline Blade. In Section 4 we 
explore two end-to-end distributed systems that use Blade 
and in Section 5 we evaluate both systems. In Section 6 we 


' Generational garbage collectors often allow users only to disable collection 
for the old generation, but this just serves to delay, not eliminate, memory 
exhaustion 


discuss the results and limitations, while in Section 7 we 
describe related work. Finally, we conclude in Section 8. 

2. Background 

2.1 Data Center Performance Today 

The performance demands of applications running in data 
centers are changing significantly. To enable rich interactions 
between services without impacting the overall latency expe¬ 
rienced by users, average latencies must be in the few tens 
or low hundreds of microseconds [8, 54]. Because a single 
user request may touch hundreds of servers, the long tail of 
the latency distribution we must also consider [18, 31], with 
each service node ideally providing tight bounds on even the 
99.9th percentile request latency. 

Today, most commercial Memcached deployments provi¬ 
sion each server so that the 99th percentile latency does not 
exceed 500/ii [35]. Recent academic results such as the IX 
operating system can run Memcached with 99th percentile 
latencies of under lOO/fi at peak [9]. The MICA key-value 
store can achieve 70 million requests-per-second with tail- 
latencies of 43/ri [37]. Current research projects such as 
RAMCloud [45, 47] are targeting 10 jls or lower RPC laten¬ 
cies. 

2.2 State of Garbage Collection 

We give a brief overview of garbage collection and the trade¬ 
offs for the main approaches. Garbage collectors deal with 
two major concerns; finding and recovering unused memory, 
and dealing with heap fragmentation, often by relocating live 
objects. We will look at four collector designs; stop-the-world 
(STW), concurrent, real-time and reference counting. 

Stop-the-world Stop-the-world collectors are the oldest, 
simplest and highest throughput collectors available [22]. A 
STW GC works by first completely stopping the application, 
then starting from a root set of pointers (registers, stacks, 
global variables) traces out the applications live set. Objects 
are either marked as live, or relocated to deal with fragmenta¬ 
tion. Next, the application can be resumed and any unmarked 
objects added to the free list. 

Their simplicity and high-throughput make them common. 
For example. Go, Ruby, and Oracle’s JVM (by default) use 
STW collectors. The downside is that pause times are pro¬ 
portional to the number of live pointers in the heap. As a re¬ 
sult, state-of-the-art STW collectors can have pause times of 
10^0 ms per GB of heap [22]. 

Concurrent Concurrent collectors attempt to reduce the 
pause time caused by STW collectors by enabling the GC 
to run concurrently with application threads. They achieve 
this by using techniques such as read and write barriers to de¬ 
tect and fix concurrent modifications to the heap while tracing 
live data and/or relocating objects. For example, a common 
approach to concurrent tracing is to use write barriers, either 
through inline code or virtual memory protection, whereby 
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any modification to the heap will enter a slow path handler 
that adds the pointer to the list of pointers to trace [23, 27,64]. 
For handing concurrent relocation of objects to reduce frag¬ 
mentation, either Brook’s style read barrier [12] are used, 
where all pointer dereferences check to see if the object has 
been replaced with a forwarding pointer pointing to the ob¬ 
jects new location, or, direct access barriers [7, 15, 23], where 
a read barrier is used to fix up any pointer to point to the new 
location as the pointer is read from the heap. 

Because of these techniques, pause times for the best con¬ 
current collectors are measured in the few milliseconds [4,23, 
30]. However, concurrent collecters have lower-throughput, 
higher implementation complexity and edge cases that still 
require GC pauses. First, concurrent collectors reduce appli¬ 
cation throughput between 10-40% and increase memory us¬ 
age by 20% compared to STW collectors [19, 22, 64]. This 
is due to the overhead of handling barriers, forwarding point¬ 
ers and synchronization between the GC threads and appli¬ 
cation. Second, most concurrent collectors have corner cases 
that trigger long pauses. For example, STW pauses are often 
used to start or end collector phases [27, 64], the amount of 
work that can occur in a slow-path for an allocation or barrier 
is variable and often unbounded [27, 50], and high-allocation 
rates can cause the application to outpace the collector and 
pause [51]. Finally, concurrent collectors are incredibly com¬ 
plex. Oracle’s JVM, for example, has two concurrent collec¬ 
tors, CMS and Gl, but both have pause times in the hundreds 
of milliseconds due to significant stages of their collection 
cycle being STW. 

Real-time Garbage collectors designed for real-time sys¬ 
tems take the approaches of concurrent collectors even fur¬ 
ther, many offering the ability to bound pause times. The 
best collectors can achieve bounds in the tens of microsec¬ 
onds [50, 51], however doing so comes at a high throughput 
cost ranging from 30%-100% overheads, and generally in¬ 
creased heap sizes of around 20% [50,51]. This is due to tech¬ 
niques such as fragmented allocation [6] to avoid the recom¬ 
pacting stage taken by most non-real-time concurrent collec¬ 
tors to handle fragmentation. Fragmented allocation allocates 
all objects at small, fixed size chunks, breaking up logical 
objects larger than the chunk size. The extra indirection can 
greatly impact system performance. 

Reference counting A completely different approach to a 
tracing garbage collector is reference counting. Each object 
has an integer attached to it to count the number of incom¬ 
ing references, which once it reaches zero, indicates the ob¬ 
ject can be freed. It’s largely predictable behaviour and sim¬ 
ple implementation makes it common, for example. Python, 
Objective-C and Swift all use reference counting. 

In general, reference counting greatly improves pause 
times since there is no background thread for collection, 
instead reclaiming memory is a incremental and localised 
operation. However, three problems emerge: lower through¬ 


put, free-chains and cycles. First, reference counting suffers 
from poor throughput due to the need for atomic increments 
and decrements on pointer modifications. On average refer¬ 
ence counting has 30% lower throughput compared to trac¬ 
ing collectors [10, 56]. Recent work has improved this to be 
competitive [56, 57] but does so by incorporating techniques 
from tracing collectors and bringing pauses. Second, refer¬ 
ence counting collectors can suffer from long pauses on free 
operations when doing so causes a long chain of decrements 
and frees to other objects in the heap. Third, and finally, ref¬ 
erence counting suffers from it’s inability to collect cyclic 
data structures. This is solved by either complicating the in¬ 
terface to the developer and asking them to break cycles, or 
by including a backup tracing collector to collect cycles peri¬ 
odically [56]. Python for example takes this approach. 

3. Design 

Blade is an interface to the run-time system (RTS) of the 
language that allows programmers to participate in the deci¬ 
sion to pause for collection, letting them customize the col¬ 
lection policy to their system. Blade is not a new approach 
to garbage collection, but a new approach to dealing with its 
performance impact in distributed systems. 

Table 1 summarises the API for Blade, which consists of 
three simple functions. The regGCHcUid (handler) simply 
setups a function as a target for an upcall from the RTS. The 
startGC (id) starts the collector, passing in an id number 
previously given to the application through an upcall. The id 
argument is a monotonically increasing argument over up- 
calls and serves to make the function idempotent. Finally, 
the upcall function gcHand (id, allocated, pause), is 
invoked by the RTS at the start of every collection and allows 
the application developer to decide if the collection should 
occur immediately or be delayed. The id argument identifies 
this collection event, while the a argument indicates the cur¬ 
rent heap size. Finally, the p argument gives an estimate by 
the collector on the time that this collection will take. The 
function can either return a boolean result of true to indicate 
that the RTS should immediately perform the collection, or 
it can return false to delay the collection until startGC is 
called. 

This API is simple enough that most garbage collected 
languages can implement it in a hundred lines of code or so. 
For example, it took 112 lines of code to implement Blade 
for the Go programming language. 

While there are a few different design choice for the API, 
we decided on this one as it is minimal and easily sup¬ 
ported by languages, yet expressive enough for supporting 
our end-to-end systems. The parameters passed through to 
the gcHand function are where we had the most choice, and 
indeed the right choice here will likely vary slightly from lan¬ 
guage to language. For exampe, in Java, a third parameter of 
the amount of heap remaining would be appropriate, but our 
target language of Go doesn’t support any notion of bound- 
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Table 1: Blade API 


Operation 

Description 

void regGCHand(handler) 

bool gcHand(id, allocated, pause) 

void StartGC (id) 

Set function to be called on GC 
Upcall to schedule GC event 

Start the GC 


ing the heap size. The purpose of the arguments to gcHand is 
to allow the application developer to make appropriate policy 
decisions on when to collect. This will generally be a binary 
choice of collecting now, or deferring collection until appro¬ 
priate failure recovery actions have been taken to minimize 
the latency impact. The right decision for the application will 
depend on the expected latency impact of the collection as 
short collection do not make sense to coordinate globally. The 
estimated pause time in our current Go implementation is de¬ 
rived by simple linear extrapolation from previous collection 
pause times at different heap sizes. 

As Blade allows delaying collection, the RTS must de¬ 
cide both when to make the upcall to the application and what 
to do if memory is exhausted before the collection is sched¬ 
uled. For the first situation we add a configurable low-water- 
mark parameter to the RTS to allow specifying how much 
room for delay should be left when upcalling into the applica¬ 
tion. For the exhaustion situation, we simply have the collec¬ 
tor run immediately. Any future call to startGC (id) with 
that collection events id number will be ignored. This retains 
safety and simply reduces performance in the worst case to 
one without Blade. We initially tried adding a second up¬ 
call from the RTS to the application to notify them when this 
timeout occured, but found that it was both complex to handle 
and generally of little benefit. Given that this is also expected 
to be rare, moving startGC (id) to be idempotent resulted 
in what we believe to be a stronger design. 

4. Blade Systems 

In this section, we apply Blade to two different end-to-end 
distributed systems. First we look at the simplest case for 
Blade, a cluster of stateless HTTP servers behind a load- 
balancer, next, we look at the Raft [44] consensus algorithm. 

4.1 HTTP Proxy; No shared state 

The most natural application domain for Blade is a fully 
replicated service where any server can service a request. 
Here we consider a load-balanced HTTP service where a 
single coordinating load-balancer proxies client requests to 
many backend servers. Typically, all backend servers are 
identical and the load-balancer uses simple round-robin to 
schedule requests. The load-balancer can also detect when 
backend servers fail by imposing a timeout on requests. How¬ 
ever, since some HTTP requests might take a while to ser¬ 
vice, the load-balancer cannot easily distinguish between a 
misbehaving server servicing a fast request, and a properly 


behaving server servicing a slow request. As a result, time¬ 
outs are typically set high - for example, in the NGINX web 
server [2], the default timeout is 60 seconds. 

The HTTP load-balanced distributed system has a few 
unique properties. First, each request can be routed to any 
of the replicas. Second, any mutable state is either stored ex¬ 
ternally (e.g., in a shared SQL database) or is not relevant for 
servicing client requests (e.g., performance metrics). Third, 
the HTTP load-balancer acts as a single, centralized coordi¬ 
nator for all requests^. These three properties make Blade 
easy to utilize. 

The approach is to have a HTTP server explicitly notify 
the load-balancer when it needs to perform a collection, and 
then wait for the load-balancer to schedule it. Once the collec¬ 
tion has been scheduled, the load-balancer will not send any 
new requests to the HTTP server, and the HTTP server will 
finish any outstanding requests. Once all requests are drained, 
it can start the collection, and once finished, notify the load- 
balancer and begin receiving new requests. In most situations, 
the load-balancer will schedule a HTTP server to collect im¬ 
mediately. However, it may decide to delay the collection if a 
critical number of other HTTP servers are currently down for 
collection. This allows the load-balancer to make decisions 
with throughput impacts in mind. Figure 2 shows the pseu¬ 
docode for how a backend server uses Blade. One subtlety 
is when deciding to handle a collection, the application starts 
a new thread (a cheap operation in Go) as the thread that in¬ 
voked the callback is another application thread that just tried 
to allocate, so may be holding locks. 

4.1.1 HTTP: Performance 

Using Blade with the HTTP cluster allows us to trade ca¬ 
pacity for better latency, as such, no request should ever block 
waiting for the garbage collector. We can model this formally 
to investigate the impact of a GC event on the system. We 
break down the stages involved at a single HTTP backend for 
performing a garbage collection using Blade; this can be 
seen in Figure 3. It consists of Tscheduie^ the time to both re¬ 
quest and be scheduled to GC by the load-balancer, Tfraiiers, 
the time for the HTTP server to service any outstanding re¬ 
quests, Tgc, the time to perform the garbage collection, and 


^ Some deployments have multiple HTTP load-balancers, themselves load- 
balanced with DNS or IP load-balancing, however, commonly each load- 
balancer in this case manages a separate cluster anyway to make more 
effective load-balancing decisions. 
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1 func bladeGC(id) 

^ rpc(controller, askGC) 

3 waitTrailers0 

4 startGC(id) 

5 rpc(controller, doneGC) 

6 

7 func handGCdd, allocd, pause) bool 

8 if threshold(allocd, pause) 

9 return true 

10 else 

11 // start in new thread 

12 go bladeGC(id) 

13 return false 


Figure 2: Pseudocode (Go) for HTTP Cluster using BLADE. 

Trpc, the time to send an RPC notifying the load-balancer the 
GC is finished. This gives us the following model: 


HTTP Cluster GC Model 

Latencyimpact 

= 0 

Capacity Loss 

= 1 server 

Ca pacityDowntime 

~ ^trailers ^gc “t" ^notify 

EventTime 

= Tschedide + Capacity Downtime 


In general, we expect Tgcheduie to be 1 network round-trip- 
time (RTT), while T^otify should be f the network RTT. The 
value of Ttraiiers is application specific, but importantly, is 
a term expressed in units that the application developer is 
intimately familiar with. 

The latency impact of zero is of course only true when the 
current throughput demand on the cluster is low enough to be 
satisfied by the remaining servers without queuing. However, 
even when this isn’t the case as the load-balancer spreads all 
requests evenly over the remaining servers, no individual re¬ 
quest experiences a disproportionate latency impact. Without 
Blade, the latency impact on requests of garbage collec¬ 
tion would be the length of the GC pause, Tgc, potentially far 
longer than 0. On the downside, using Blade does extend the 
duration of the capacity downtime by T,mailers + T„otify, which 
has a lower bound of half the RTT. 

Importantly this model show how Blade allows develop¬ 
ers to achieve the three goals we started with: bounding la¬ 
tency, do so using failure recovery mechanisms present in the 
system, and model the performance impact of garbage col¬ 
lection on the system. For a HTTP cluster. Blade bounds la¬ 
tency to 0 by using the load-balancer and allows us to model 
this without concern for workload, heap size or the underly¬ 
ing garbage collection algorithm. 

4.2 Raft: Strongly consistent replication 

In the HTTP load-balancer, because there is no shared muta¬ 
ble state at the server, any server can service any request and, 
as a result, we can treat garbage collection events as tempo¬ 
rary failures. The same is true when mutable state is consis- 


100 % — 


can gc start end gc 

gc? ok gc gc done 


Schedule ^trailers ^gc ^rpc 

Figure 3: Capacity of a HTTP server over time during a garbage 
collection cycle with Blade. 

tently shared between all servers, for example, as in a Paxos- 
like [34] system that uses a consensus algorithm for strongly 
consistent replication. In this section we consider how to use 
Blade for the Raft [44] consensus algorithm. 

In Raft, during steady-state, all write requests flow through 
a single server referred to as the ‘leader’. Other servers run as 
‘followers’. Writes are committed within a single round-trip 
to a majority of the other servers, leading to sub-millisecond 
writes in the common case^. Garbage collection pauses can 
hurt cluster performance in two cases. First, when the leader 
pauses, all requests must wait to be serviced until GC is com¬ 
plete. If GC pause time exceeds the leader timeout (typically 
150ms), the remaining servers will elect a new leader before 
GC completes. Second, if a majority of the servers are paused 
for GC, no progress can be made until a majority are live 
again. The second case is worse, because if garbage collec¬ 
tion pauses are very long, there is no built-in way for the sys¬ 
tem to make progress during this time. The probability of this 
occurring is higher than expected as the memory consump¬ 
tion will be roughly synchronized across servers because of 
the replicated state machine each on is executing. 

We use Blade with Raft as follows. First, when a fol¬ 
lower needs to GC, we follow a protocol similar to the HTTP 
load-balanced cluster. The follower notifies the leader of it’s 
intention to GC and waits to be scheduled. The leader sched¬ 
ules the collection as long as doing so will leave enough 
servers running for a majority to be formed and progress 
made. We only consider servers offline due to GC for this, 
as servers down for other reasons could be down for an ar¬ 
bitrary amount of time. The leader must also timeout servers 
considered down for garbage collection to prevent blocking, 
marking their GC as completed, in the rare event that they be¬ 
come unavailable during a collection. One important differ- 


^ In a low-latency network topology and persistent storage (such as flash 
drives). 
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1 // run in own thread 

func bladeClientO 

3 reqlnFlight := 0 

4 forever 

5 select 

6 case id := •<— gcRequest: 

7 reqlnFlight = id 

8 rpc(leader, askGC) 

9 

10 case id := •<— gcAllowed: 

11 reqlnFlight = 0 

12 startGC(id) 

13 rpc(leader, doneGC) 

14 

15 case leader := •(— leaderChange: 

16 if reqlnFlight != 0 

17 rpc(leader, askGC) 

18 

1' func handGC(id, allocd, pause) bool 

20 if threshold(allocd, pause) 

21 return true 

22 else 

23 // start in new thread 

' go funcO { gcReq ■(— id }() 

25 return false 


Figure 4: Pseudocode (Go) for Raft server, when functioning as a 
follower and not a leader, using BLADE. The ■<— symbols represent 
message passing between threads using channels. 

ence with Raft compared to the HTTP load-balancer is that 
the leader doesn’t need to stop sending requests to followers 
while they are collecting, and neither do followers need to 
wait to finish any outstanding requests. As Raft is designed to 
make progress with servers unavailable, we can rely on this 
and have a follower proceed with GC. We present the pseu¬ 
docode for the follower situation in Figure 4, including the 
retry logic for sending requests to the new leader if it changes 
over the course of a collection request. The code makes use 
of channels, a message passing mechanism provided in Go. 

The second situation, when a server is acting as leader for 
the cluster, is more interesting. Since the cluster cannot make 
progress when the leader is unavailable, we switch leaders 
before collecting. Once the leadership has been transferred, 
the old leader (now a follower) runs the same algorithm as 
presented previously for followers in Figure 4. A leadership 
switch like this can be done in just f the RTT of the network 
by having the current leader send a broadcast to all servers 
in the cluster notifying of the new leader [43]. The current 
leader may need to delay switching leadership until it knows 
that the next chosen leader is up-to-date, but during this time, 
the cluster can continue servicing requests. We present the 
Pseudocode for the leader situation in Figure 5. In this design 
the current leader chooses the last server that collected to be 
the next leader, or a random server if this information isn’t 
known. Since the current leader acts as the coordinator for 
garbage collection, it also keeps track of how many servers 


I // run in own thread 
func bladeLeader 0 void 
used := 0 

4 pending := queue.New() 

5 lastGC := cluster .RandomServerO 

6 forever 

7 seleet 

8 case from := gcRequest: 

9 if used + 1 > quorum 

10 pending.End(from) 

11 else 

12 used-H- 

13 if from == mylD 

14 switchLeader(lastGC) 

15 else 

16 rpc(from, allowGC) 

17 

18 case from := gcFinished: 

19 lastGC = from 

20 if pending.LenO == 0 

21 used— 

22 else 

23 from = pending. FrontO 

24 if from == mylD 

25 switchLeader(lastGC) 

26 else 

27 rpc(from, allowGC) 

28 

29 case leader := leaderChange: 

30 used = 0 

31 pending.Clear0 

32 } 


Figure 5: Pseudocode (Go) for Raft server, when functioning as a 
leader, using BLADE. 

are currently collecting, and queues requests for future col¬ 
lections from servers that cannot be scheduled immediately. 

Finally, outstanding client requests at the old leader must 
be handled. One method is to notify clients of the new leader 
and have them retry. This is simple but incurs more latency 
than required. Instead, the old leader can act as a proxy for 
these requests, forwarding them to the new leader in the same 
RPC as the election switch message. The new leader can 
either reply to clients through the old leader, or directly to 
them, depending on the client design. 

4.2.1 Raft: Performance 

As before with the HTTP cluster, we can model the perfor¬ 
mance impact on Raft of a GC event when using Blade. 
First, we model the impact when a follower collect, and sec¬ 
ondly, when the leader collects. 

In the first case, when a follower collects, the Raft cluster 
can service this GC without any impact on the latency of the 
system. Throughput should also be unaffected, although we 
are making the assumption that the cost to bring a unavailable 
server up-to-date after a GC does not noticeably affect the 
throughput and latency of the cluster. This gives us the model 
below for the impact of a follower GC event, where we expect 
Tscheduie to be the network RTT in the common case; 
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Component 


SLOC 


Raft Follower GC Model 

Latency Impact = 0 

CapacityLoss = 0 

Event! ime = Tschedule + Tgc 


In the second case, when a leader collects, then we will 
take the additional cost of a fast leader election and proxying 
queued client requests to the new leader. This gives us the 
model below: 


Raft Leader GC Model 

LatencyIm pact 

— fast elect Tproxy 

= IRTT 

CapacityLoss 

= 0 

Event!ime 

~ Tfast elect proxy “ 1 " T^gc 

= jR!! + !pmxy + Tgc 


One complication with the leader case, captured by the 
Tproxy value, is that the leader needs to both forward any 
queued requests from clients to the new leader, and also 
should inform clients that a leadership change has occurred. 
The time it takes to do this, and so for how long the leader 
should delay beginning its GC, is highly dependent on the 
system setup. With a small number of known clients, the 
leader can broadcast to them that a new leader has been 
elected. With a larger, or unknown number of clients, a proxy 
layer may be desirable that clients go through. 

5. Evaluation 

To evaluate Blade, we used it in two distributed systems, 
first a HTTP cluster behind a load-balancer, and second, the 
Raft consensus algorithm. Both systems are previously de¬ 
scribed in Section 4. 

For evaluating the performance of the GC system, we use 
the standard Go garbage collector since all our systems are 
written in the Go programming language. We use Go version 
1.4.2, the latest at the time of writing. Go currently uses a 
parallel mark-sweep collector, with marking done as a stop- 
the-world phase and sweeping done concurrently with the 
application (mutator) threads. Because this GC design is far 
from state-of-the-art (although still very common in modern 
languages), we also compare against the ideal case of no 
garbage collection at all. We do this by simply disabling 
Go’s garbage collector, so memory is never reclaimed. Go 
by default also runs the collector every two minutes if not 
run recently in order to give memory back to the operating 
system. For all of the evaluations below we disable this as we 
felt it unfairly favoured Blade by being a explicit source of 
synchronization. 

All experiments were run over a lOGbE network, using 
machines with Intel Xeon E3-1220 4 core CPU’s with 64 GB 


Web Application 54 

Load-balancer Coordinator 174 

Table 2: Source code changes needed to utilize BLADE with a web 
application cluster. 

of RAM and running EreeBSD 10.1. The network RTT was 
measured to be 48/Ls on average. 

5.1 HTTP Load-Balancer Performance 

In this section we investigate the performance impact of using 
Blade with a HTTP load-balanced cluster. 

We built a simple web application that allows users to 
search and retrieve movie information from a backing SQL 
database. The web app keeps an in-memory local cache of 
recent movie insertions and retrievals to improve perfor¬ 
mance by avoiding a DB lookup on each requests. The ap¬ 
plication does not allow updates to existing records. We run 
HAProxy [60] version 1.5 (latest at time of writing) in front 
of three servers, using round-robin to load balance requests 
across all three. 

Adding support for Blade to the web application required 
228 SLOC to be added. Of these, 54 were added to the web 
application itself, while the other 174 were for implementing 
a controller for the load balancer to coordinate the GC at 
each web application server and ensure only one was ever 
collecting at any point in time. As HAProxy already supports 
a TCP interface for enabling and disabling backends, the 
coordinator is only required when enforcing capacity SLAs. 

We also evaluated the latency behaviour of the three dif¬ 
ferent configurations of the cluster using a fifth machine to 
generate load. A CDL of the request tail-latency when gener¬ 
ating 6,000 requests-per-second can be seen in Ligure 6. We 
ran the experiment for six minutes, during which each node 
collects three times. We ran the experiment four times in to¬ 
tal for each configuration and averaged the results. Blade 
achieves a result so similar to the GC-Off configuration that 
we have to present them on the same line in Ligure 6. Overall 
performance of each configuration can be seen in Table 3. The 
GC-On configuration has tail-latencies far beyond the time the 
application is paused by the garbage collector. This appears 
to be due to the impact of queues building up, occasional net¬ 
work retransmissions when buffers overflow, and unfair ser¬ 
vicing of pending sockets by Go. This amplification affect has 
previously been explored [36, 63]. 

During these runs we also observed occasions when the 
garbage collection event at a backend server overlapped with 
another. An example of such an overlap can be seen in Lig¬ 
ure 7, with the latency of requests to each server shown as 
the GC event occurs at servers B and C. Out of a total of 36 
observed collections across the three servers, 8 of them over¬ 
lapped for an average of 22.2% of collections. While this is 
likely high due to the experimental setup, real-world systems 
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CDF (P[Latency<x]) 


GC-Off Blade GC-On 


150- 


Mean 

2.312 

2.311 

2.403 

Median 

2.296 

2.294 

2.297 

Std. Dev. 

0.579 

0.582 

3.395 

Max 

7.847 

7.443 

164.206 

Avg. GC-Pause 

0 

12.423 

12.339 


Table 3: Latency measurements of requests to 3-node HTTP cluster 
behind a load-balancer under different GC configurations. Timings 
are in milliseconds (ms). Same experiment as Figure 6. 
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Figure 7: Request latency of HTTP cluster broken out by backend 
server during a collection event at two of the workers. Over four 
runs of the latency experiment we observed 36 collections, with 8 of 
them overlapping, or 22.2%. 


GC-Off Blade GC-On 

Requests/s 42,643 51,624 51,983 

Std. Dev. 2,213 2,573 672 

Table 4: Max throughput of each configuration for the HTTP 
cluster. Results are averaged from three runs, each run being 6 
minutes long. 


Figure 6: CDF of request tail-latency to 3-node HTTP cluster 
behind a load-balancer. BLADE and the GC-Off configuration are 
so similar that their lines overlap. Load generator is simulating 
400 connections to the cluster, sending 6,000 requests-per-second. 
Each node has a 1GB heap for a 150MB live set, and allocates on 
average at 12.5 MB/s. Each node collects 3 times during the 6 minute 
experiment. 

often have external sources of synchronization that increase 
the chances of these overlaps occurring. For example, the Go 
default GC policy of running every two minutes (which we 
disabled), or when sudden surges of traffic hits the cluster. 

Finally, we ran a second experiment on the same cluster 
to check the throughput that each configuration is capable of, 
the results of which are presented in Table 4. As expected. 
Blade doesn’t cause any drop in throughput compared to the 
regular GC-On setup, both achieving around 52,000 requests- 
per-second. The GC-Off configuration however achieves a 
lower throughput due to the overhead of constantly requesting 
fresh memory from the OS, consuming a 34GB heap by the 
end of the experiment. We used these numbers to run one 
final latency test, but generating 40,000 requests-per-second 
this time, close to the peak for all three configurations. The 
results can be seen in Figure 8. The slight penalty that the 
Blade configuration pays at the tail, from reduced capacity, 
when under load, can be seen when comparing GC-Off with 


Blade. Blade is on average 100-300/fi slower from the 
95th percentile on. 

5.1.1 Web Application Frameworks 

Using Blade in a web application is generic enough in 
nature that we can package it as a library. To demonstrate 
this we wrote a Go package that can be included by any web 
application that uses the popular Gorilla Web Toolkit [1]. It’s 
tied specifically to Gorilla because we need to be able to 
detect when all trailing requests have completed (or be able 
to cancel them if desired). By including this package, any 
Gorilla web application that can work with a client session 
being handled by different servers, can benefit from Blade. 

5.2 Raft Performance 

In this section we investigate the performance impact of using 
Blade with the Raft consensus algorithm. As Raft is not a 
standalone system, we use Etcd [16], a replicated key-value 
store with a ZooKeeper [29] inspired API that uses Raft for 
the consensus algorithm. 

To efficiently use Blade in Etcd involved implementing 
support for fast-leadership transfers, and also handling GC 
upcalls using the algorithms outlined in Figure 4 and Figure 5. 
This required 563 lines of code to be changed (largely addi¬ 
tions) in Etcd, with the breakdown shown in Table 5. As fast- 
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Figure 8: CDF of request tail-latency to 3-node HTTP cluster be¬ 
hind a load-balancer. Load generator is simulating 400 connections 
to the cluster, sending 40,000 requests-per-second. Each node has 
a 1.5 GB heap for a 200MB live set, and allocates on average at 
83.3MB/S. Each node collects 17 times during the 6 minute experi¬ 
ment, with an average pause time of 14.45 ms. 


Component SLOC 

Fast-leader switch 214 

Blade GC support 349 

Table 5: Source code changes needed to utilize BLADE in Etcd. 

leadership transfers are useful for purposes beyond Blade, 
it is fair to count the effort needed to support Blade in Etcd 
as 349 SLOC. 

For evaluating the performance of Blade with Raft, we 
set up a three node Etcd cluster under three different config¬ 
urations. First, when running with the standard Go garbage 
collector, secondly, when running with the garbage collector 
disabled, and finally, when using Blade. We ran a single ex¬ 
periment were we loaded 600,000 keys into Etcd and then 
sent 100 requests per second at regular intervals for 10 min¬ 
utes to the cluster using a mixture of reads and writes in a 
3 : 1 ratio. We track the latency of each request after the ini¬ 
tial load of keys. We ran the experiment three times for each 
configuration and took the average of the three. In all con¬ 
figurations the standard deviation between the three runs was 
less than 5%. We use a low request rate as at this time, Etcd 
is early in its development and doesn’t support a high request 
rate, (peaking at around 400 requests/s on our setup) becomes 
very unstable anywhere close to its peak 

With the GC enabled, this experiment peaks at consum¬ 
ing 473 MB of memory. While very small by modern server 
standards, it is sufficient to evaluate our results since Blade 
thankfully is not affected by heap size in terms of latency im¬ 
pact on requests. 


Mean Median Std. Dev. Max 

GC-Off 0.532 0.530 0.031 1.127 

Blade 0.505 0.499 0.030 1.015 

GC-On 0.589 0.517 2.112 95.969 

Table 6: Latency measurements of SET requests to Etcd cluster 
under different GC configurations. Timings are in milliseconds (ms). 



Blade 

GC-Off 

GC-On 

95th 

0.52 

0.54 

0.54 

99th 

0.57 

0.57 

0.56 

99.9th 

0.62 

0.67 

28.81 

99.99th 

0.70 

1.03 

86.98 

99.999th 

0.80 

1.08 

94.22 

99.9999th 

0.97 

1.12 

95.96 


Table 7: SLA measurements for different Etcd configurations. Tim¬ 
ings are in milliseconds (ms). 

The results for set request latencies for all three configu¬ 
rations are shown in Table 6. Excluding tail-latency, all three 
achieve similar performance levels, although Blade outper¬ 
forms each configuration across the board. Blade achieves a 
mean of 505jtli and a worst-case of 1.01 ms, GC-Off a mean 
of 532/Lf and a worst-case of 1.13ms, and GC-On a mean of 
589/fs and a worst-case of 95.96ms. The reason Blade even 
outperforms the GC-Off configuration is due to the penalty 
GC-Off pays from the extra system calls and lost locality 
from requesting new memory rather than ever recycling it. 
Results for get requests show the same relation among the 
three configurations. 

When looking at the tail-latency of each configuration, a 
different story emerges. A CDF of slowest 1% of both get 
and set requests can be seen in Figure 9. As expected, perfor¬ 
mance of the standard GC configuration has a very long tail 
from GC pauses. The results for the GC-Off configuration 
and the Blade configuration however are nearly identical. 
This is expected from the performance model we established 
in Section 4.2.1, that showed latency has an expected increase 
of 1 network RTT during a GC, a value of 48 /f s in our exper¬ 
imental setup. Indeed, as can be seen in the detailed breakout 
of the performance in Table 7, Blade slightly outperforms 
the GC-Off configuration. This is due to the GC-Off configu¬ 
ration paying a penalty from the higher memory use as men¬ 
tioned previously, and the 48 fis being within the tail-latency 
caused by other sources of jitter such as the OS scheduler. 

6. Discussion & Limitations 

Understanding Performance With Blade we set out to 
achieve three goals: bound tail-latency in distributed systems, 
do so using system specific failure recovery mechanisms, and 
to allow the performance impact of garbage collection to be 
modelled without knowledge of production workloads. For 
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Figure 9: CDF of Etcd replicated key-value store request latency 
for all GC configurations 

the final point, the performance models for each system in 
Section 3 demonstrate how Blade can achieve this. With 
Raft for example, we know that the latency impact of using 
a garbage collected language will be a single extra network 
RTT. As the time to complete any request is at least one 
network RTT, using a garbage collector with Blade limits 
the tail-latency from GC to twice the mean in the worst case. 
Importantly, the impact of GC is now in units comparable 
to the rest of the system. Furthermore, as network speeds 
improve over time, so will Blade. While our test setup had a 
network RTT of 48/ii, 10 Gbps NICs are currently available 
that achieve less than 1.5,jJ-s latency at the end-host [40]. 
Garbage collectors are chasing a continually moving target, 
but Blade scales with the performance of the distributed 
system. 

Limitations Another important outcome from using Blade 
in a distributed system, is the changed requirements for the 
garbage collector. As Blade deals with the latency impact, 
in most situations a concurrent collector will no longer be 
the best fit. Instead, a simpler and high-throughput stop- 
the-world collector is best suited [22]. These collectors are 
already readily available in most languages, unlike high- 
performance concurrent collectors. 

There are, however, a number of limitations with Blade. 

• First and foremost. Blade is not a universal solution. 
We specifically target distributed systems as this is an 
important area where low-latency matters and we’ve had 
bad experiences with garbage collected languages. Even 
then. Blade wil not work for every distributed system as 
it relies on their being a failure recovery mechanism that 
can be exploited. Common, but not universal. 

• Secondly, Blade requires developers to write code and 
doesn’t apply transparently. This, however, is by design 
and we believe our results show that the amount of work 


needed is low. Even with garbage collection, developers 
do not ignore memory management and still apply tech¬ 
niques such as local allocation caches to improve perfor¬ 
mance, Blade is one more technique that can be used. 

• Third, Blade takes whole servers offline during a garbage 
collection, which may be too large a capacity loss for 
some systems. If, for example. Blade was used for a 
HTTP cluster with only two servers, then using 50% of 
the cluster capacity is likely to be unworkable. 

7. Related Work 

Trash Day Mass et al. have recently done work on coordi¬ 
nating garbage collection in a distributed system [38]. They 
look at two different systems. Spark [65] and Cassandra [33], 
noting that for Spark, having all nodes collect at the same time 
improves performance, while for Cassandra, staggering col¬ 
lection and routing requests around nodes can reduce tail la¬ 
tency. They design a run-time system to provide a general ap¬ 
proach to this problem, allowing different coordination strate¬ 
gies across multiple nodes to be implemented. 

Process Restarts We have heard of a few different compa¬ 
nies in industry that disable garbage collection, either com¬ 
pletely or for the old generation, and then kill and restart 
the process as needed. They will often attempt to drain re¬ 
quests before restarting the process. This is similar to Blade 
but less principled, support is not provided directly in the 
programming language and as it requires that programs can 
support arbitrary restarts, it only works for a subset of the 
programs that Blade supports. Restarting should also be a 
slower operation as it needs to reload state from permanent 
storage. As far as we are aware, none of these companies ap¬ 
ply this technique to stateful systems such as Raft. 

HTTP Load-balancing Portillo-Dominguez et al. have re¬ 
cently done work on HTTP load-balancers in Java to avoid 
the impact of garbage collection on latencies [52, 53]. Their 
approach is very similar to Blade, modifying a round-robin 
routing algorithm to avoid the collecting server. They do not 
modify the language or RTS however as we propose, instead 
they model the GC and try to predict when it will collect. Mis¬ 
predictions mean lower performance than Blade, and also 
no ability to deal with overlapping collections. They also deal 
with a very different level of performance than we are con¬ 
cerned with, starting with worst case request latencies in the 
hundreds of seconds and reducing that to the tens of seconds. 
We are instead concerned with microseconds. 

JVM & .NET The Java Virtual Machine (JVM) supports 
two programmable interfaces to the garbage collector. One is 
the System. gc () function that suggest to the RTS to start the 
GC. The other is the Garbage Collection Notifications API 
(JGCN) optional extension [46]. JGCN supports callbacks, 
like Blade, to the application, but it only supports notify¬ 
ing the application after a collection has complete. JGCN is 
intended for performance monitoring and debugging. 
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Microsoft’s .NET platform supports an API very simi¬ 
lar to Blade, the Garbage Collection Notifications API 
(MGCN) [41]. MGCN, like Blade, supports application 
callbacks before garbage collection occurs. MGCN doesn’t, 
however, allow the application to delay collection, only to 
start it earlier than the RTS planned. MGCN is suggested for 
use by Microsoft in a similar manner to Blade, but as of 
this time we are unaware of any reports on it’s usage or eval¬ 
uation of it. The lack of control with MGCN to coordinate 
nodes, and avoid GC overlaps at servers, appears to be a con¬ 
cern with some potential users. The popular Stack Overflow 
website, for example, chose not to use MGCN partially for 
this reason [55]. 

Concurrent Tracing Garbage Collectors A vast amount of 
work has been done in improving pause times of garbage 
collectors. A sample of this was covered in Section 2. Azul 
Systems Zing GC [4, 15, 23, 30] is one of the best available 
today, with pause times in the low milliseconds or microsec¬ 
onds. This is still one to two orders of magnitude above what 
Blade can achieve, and will get worse as faster networks 
with under 5 jJ-s RTT become available. The work by Pizlo et 
al. such as Schism, on real-time collectors [50, 51] achieves 
the lowest pause times we are aware of, capable of bounds in 
the tens of microseconds but suffers from 30% lower through¬ 
put and 20% higher memory consumption compared to stop- 
the-world collectors. Other concurrent and real-time collec¬ 
tors that we are aware of [5, 6, 14, 19, 27, 28, 39, 64] all per¬ 
form worse in latency and/or throughput than both Azul and 
Schism, with pause times in the tens of milliseconds. Finally, 
while many of these systems have STW pauses for some sit¬ 
uations, work such as that of Tomoharu et al. [61] seeks to 
address these final cases. 

Reference Counting The best reference counting collec¬ 
tors have very low and uniform latency impact on an ap¬ 
plication as demonstrated by the Ulterior Reference Count¬ 
ing [10] collector. However, they have historically suffered 
from lower throughput compared to tracing collectors. The 
work of Shahriyar et al. [56, 57] has made reference count¬ 
ing collectors competitive, but does so by incorporating back¬ 
ground tasks and pauses. Unfortunately Shahriyar doesn’t re¬ 
port the latency impact of these changes. 

Tail tolerance Vulimiri et al. proposed [62] an approach to 
handling tail-latency for Internet services such as DNS, where 
requests were duplicated and sent to multiple servers. Dean 
and Barroso proposed and investigated a similar idea [18], 
but specifically for addressing tail-latency [18] in data cen¬ 
ters. Servers then either race to fulfil the request, or coordinate 
with each other to claim ownership of the request when they 
start processing it. Jalaparti applied this idea, as well as allow¬ 
ing incomplete requests, to build a framework for construct¬ 
ing data center services [31]. Blade takes a similar approach 
to tail tolerant systems, not attempting to reduce the impact 
of garbage collection on an individual server, but avoiding 


it’s impact on the end-to-end system. Blade however solves 
a specific, but common problem, garbage collection, rather 
than treating servers as a black box. This allows Blade to be 
used in situations such as Raft where tail tolerant systems do 
not apply as requests cannot be duplicated and sent to multi¬ 
ple servers. 

8. Conclusion 

Blade is a new approach to garbage collection for a partic¬ 
ular, but large and important class of programs: distributed 
systems. Blade uses the ability of distributed systems to 
deal with failure, to also handle garbage collection, treating 
garbage collection as a partially predictable and controllable 
failure. We applied Blade to two important and common 
systems, a cluster of web servers and the Raft consensus al¬ 
gorithm. For the first case, we eliminated the latency impact 
of garbage collection, and for the second, we reduced it to the 
order of a single network round-trip, or 48 /rs in our experi¬ 
ment. As Blade handles the impact of garbage collectors in 
distributed systems rather than attempt to improve them di¬ 
rectly, it allows for a different set of choices when designing 
the collector. Simple, high-throughput designs are preferable 
with Blade than the complexity of collectors that try to min¬ 
imize pause times. 
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